The dataset consists of data collected from WHO (health factors, life expectancy) and economic data from the United Nation website and was published on Kaggle (https://www.kaggle.com/kumarajarshi/life-expectancy-who). It has been observed that in the past 15 years, there has been a huge development in the health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Data from 2000-2015 for 193 countries has been collected and was merged together.
Most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories: Immunization related factors, Mortality factors, Economical factors and Social factors.
Predictor explanations:
Status: Is a country already developed or is it still developing?
Adult mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
Infant deaths: Number of Infant Deaths per 1000 population
Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
Percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Measles: Measles - number of reported cases per 1000 population
BMI: Average Body Mass Index of entire population
Under five deaths: Number of under-five deaths per 1000 population
Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP: Gross Domestic Product per capita (in USD)
Population: Population of the country
Thinness 1-19: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
Thinness 5-9: Prevalence of thinness among children for Age 5 to 9(%)
Income comp: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling: Number of years of Schooling(years)
For this assignment, the life expectancy will be predicted via a linear regression model and a regression model with random forests. This to find relaitonships within the life expectancy data set from the WHO.
library(tidyverse)
library(randomForest)
library(hexbin)
library(GGally)
library(ggplot2)
library(plotly)
library(ISLR)
library(MASS)
library(dplyr)
library(jtools)
life_expectancy <- read.csv("life-expectancy.csv")
dim(life_expectancy)
## [1] 2938 22
summary(life_expectancy)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 463.94 1st Qu.:1.958e+05 1st Qu.: 1.60
## Median : 1766.95 Median :1.387e+06 Median : 3.30
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 5910.81 3rd Qu.:7.420e+06 3rd Qu.: 7.20
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## NA's :448 NA's :652 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.50 1st Qu.:0.4930 1st Qu.:10.10
## Median : 3.30 Median :0.6770 Median :12.30
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :28.60 Max. :0.9480 Max. :20.70
## NA's :34 NA's :167 NA's :163
head(life_expectancy)
colnames(life_expectancy)[which(names(life_expectancy) == "Income.composition.of.resources")] <- "Income.comp"
life_expectancy$Status <- as.factor(life_expectancy$Status)
life_expectancy$Country <- as.factor(life_expectancy$Country)
str(life_expectancy)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : Factor w/ 193 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure: num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.comp : num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
For this part, we will check whether there is a correlation between the different variables within the life expectancy dataset. First the variables will be converted to a numerical format.
data_var <- life_expectancy %>% select_if(is.numeric)
ggcorr(data_var, label = T, label_size = 2.2, label_round = 2, hjust = 1, size = 3.5, color = "black", layout.exp = 5, low = "red2", mid = "orange", high = "springgreen", name = "Correlation")
Life.expectancy is the strongest correlated with Income.comp (0.72) and Schooling (0.75), but it is negatively correlated with Adult.Mortality. If the mortality rate of the adult is high, the life expectancy of the people within this dataset will be low. But the strongest correlations are infant.deaths with under.five.deaths (exactly 1), thinness..1.19 years and thinness.5.9.years (0.94) and between percentage.expenditure and Gross Domestic Product per capita (GDP) (0.9). But we will create a scatterplot to visualise the strongest correlation of Life.expectancy, which is the correlation between between Life.expectancy and Schooling.
ggplot(data_var,
aes(x = Life.expectancy,
y = Schooling)) +
geom_point(aes(colour = Life.expectancy), size = 1)
## Warning: Removed 170 rows containing missing values (geom_point).
The Global Vaccine Action Plan (GVAP) has publicized the National Immunization Coverage Scorecards Estimates for 2018, in which the GVAP mentions their global immunization goals (https://www.who.int/publications/m/item/national-immunization-coverage-scorecards-estimates-for-2018). One of the goals is that all countries are to reach 90% national immunization coverage for diseases such as hepatitis B, diphtheria, tetanus, pertussis and polio. The World Health Organization states that the immunization coverage is a good indicator of health system performance (https://www.who.int/data/gho/indicator-metadata-registry/imr-details/95). In order to test how good of an indicator the immunization coverage is, visualizations will be made on the coverage of polio, diphtheria and hepatitis B.
life_expectancy_diseases <- life_expectancy %>% # Divide the diseases into two groups with ifelse()
mutate(Polio = ifelse(Polio < 90, "<90% covered", ">=90% covered"),
Polio = as.factor(Polio),
Diphtheria = ifelse(Diphtheria < 90, "<90% covered", ">=90% covered"),
Diphtheria = as.factor(Diphtheria),
Hepatitis.B = ifelse(Hepatitis.B < 90, "<90% covered", ">=90% covered"),
Hepatitis.B = as.factor(Hepatitis.B))
life_expectancy_diseases <- life_expectancy_diseases %>%
drop_na()
life_expectancy_diseases %>%
group_by(Polio) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
polio_plot <- ggplot(life_expectancy_diseases, aes(x=Polio, y = Life.expectancy, fill = Polio)) +
geom_boxplot() +
scale_fill_manual(values=c("red", "green2")) +
labs(x = "Polio Coverage", y = "Life Expectancy (Age)") +
theme(legend.position = "none")
ggplotly(polio_plot)
The polio coverage is at 57.55%, which means that there are unfortunately still many countries that do not have a polio coverage of 90% or above. The median for countries with a polio coverage of 90% of above is at the age of 73.5, while the median for countries with a polio coverage of 90% or below is at the age of 65.35. That is a gap of over 8 years.
life_expectancy_diseases %>%
group_by(Diphtheria) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
diphtheria_plot <- ggplot(life_expectancy_diseases, aes(x=Diphtheria, y = Life.expectancy, fill = Diphtheria)) +
geom_boxplot() +
scale_fill_manual(values=c("red", "green2")) +
labs(x = "Diphtheria Coverage", y = "Life Expectancy (Age)") +
theme(legend.position = "none")
ggplotly(diphtheria_plot)
The diphtheria coverage is at 57.31%, similar to that of polio. The median for countries with a diphtheria coverage of 90% of above is at the age of 73.6, while the median for countries with a polio coverage of 90% or below is at the age of 65.45. The diphtheria coverage is very similar to that of polio, which might indicate that vaccinations against polio and diphtheria are given at the same time.
life_expectancy_diseases %>%
group_by(Hepatitis.B) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
hepatitis_plot <- ggplot(life_expectancy_diseases, aes(x=Hepatitis.B, y = Life.expectancy, fill = Hepatitis.B)) +
geom_boxplot() +
scale_fill_manual(values=c("red", "green2")) +
labs(x = "Hepatitis B Coverage", y = "Life Expectancy (Age)") +
theme(legend.position = "none")
ggplotly(hepatitis_plot)
The amount of countries that have a Hepatitis B coverage of 90% or above seem to be making out more or less half of the observations. In the same trend as the polio and diphtheria coverage, the hepatitis B coverage shows a higher life expectancy in countries where the coverage is at 90% or above. As it turns out, the immunization coverage appears to be a good indicator of health system performance, as the life expectancy is higher in countries where the coverage of diseases is at 90% or above.
Life expectancy is the highest for those who had been educated beyond high school. So we can conclude that most of the older people in this dataset did have a longer life expectancy in relation to the people with a lower life expectancy. There are some few outliers as you can see in the plot, but the vast majority had > 10 years of education. This could also be summarised within the linear model between Life.expectancy and Schooling (lm_life) and plotted in a graph:
lm_life <- lm(formula = Schooling ~ Life.expectancy, data = data_var)
summary(lm_life)
##
## Call:
## lm(formula = Schooling ~ Life.expectancy, data = data_var)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4111 -1.1027 0.0164 1.2696 6.1774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.643467 0.313559 -21.19 <2e-16 ***
## Life.expectancy 0.268828 0.004481 59.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.206 on 2766 degrees of freedom
## (170 observations deleted due to missingness)
## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5653
## F-statistic: 3599 on 1 and 2766 DF, p-value: < 2.2e-16
effect_plot(lm_life, pred = Life.expectancy, interval = TRUE, partial.residuals = TRUE, colors = "magenta", point.color = "darkorchid4")
Random forest out of the box can’t handle missing values so we use the roughfix method which follows the following approach: “NAs are replaced with column medians …. This is used as a starting point for inputing missing values by random forest missfill= 1,2 does a fast replacement of the missing values, for the training set (if equal to 1) and a more careful replacement (if equal to 2). mfixrep= k with missfill=2 does a slower, but usually more effective, replacement using proximities with k iterations on the training set only. (Requires nprox >0).”
set.seed(1337)
smp_size <- floor(0.8 * nrow(life_expectancy))
train_ind <- sample(seq_len(nrow(life_expectancy)), size = smp_size)
df_train <- life_expectancy[train_ind,]
df_test <- life_expectancy[-train_ind,]
rf_model <- randomForest(Life.expectancy ~ .,
data = df_train %>% dplyr::select(!c(Country)),
na.action = na.roughfix
)
rf_model
##
## Call:
## randomForest(formula = Life.expectancy ~ ., data = df_train %>% dplyr::select(!c(Country)), na.action = na.roughfix)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 3.541
## % Var explained: 96.03
mse <- function(y_true, y_pred) {
mean((y_true - y_pred)^2)
}
imp <- data.frame(importance(rf_model)) %>% arrange(desc(IncNodePurity))
ggplot() +
geom_bar(aes(x=reorder(rownames(imp), -imp$IncNodePurity),y=imp$IncNodePurity), stat="identity") +
labs(y = "Feature Importance", x = "Feature") +
theme(axis.text.x=element_text(angle=45, hjust=1))
The random forest model has shown us that the schooling variable is not as important as some other variables. For example, the barplot shows that the number of HIV deaths plays a huge role as a predictor for life expectancy. The countries where HIV leads to huge numbers of deaths are generally underdeveloped and have worse living conditions, as well as schooling; which are possible contributors to the high HIV death tolls. The human development index serves as a statistic composition of life expectancy, education and per capita income. Therefore, it also makes sense that this is an important variable.
pred <- predict(rf_model, newdata = df_test)
ggplot(df_test) +
geom_hex(aes(x=pred, y = Life.expectancy), bins = 60) +
scale_fill_continuous(type = "viridis") +
theme_bw() +
geom_abline()
## Warning: Removed 268 rows containing non-finite values (stat_binhex).
The plot shows how well the predictions on the test set fit to the actual values. We see that most of the values are very close to the diagonal line, indicating a good fit of the model. The random tree model has not been able to predict the countries that have actually have a very high life expectancy.
pred_plot <- function(model) {
# Created a test set for lstat in order to make predictions
x_pred <- df_test$Schooling
y_pred <- predict(model, newdata = tibble(Schooling = x_pred))
# Plot with the real data and a prediction line
df_train %>%
ggplot(aes(x = Schooling , y = Life.expectancy)) +
geom_point(aes(color=Status)) +
geom_line(data = tibble(Schooling = x_pred, Life.expectancy = y_pred), size = 1, col = "red") +
theme_minimal()
}
pred_plot2 <- function(model) {
# Created a test set for lstat in order to make predictions
x_pred <- df_test$HIV.AIDS
y_pred <- predict(model, newdata = tibble(HIV.AIDS = x_pred))
# Plot with the real data and a prediction line
df_train %>%
ggplot(aes(x = HIV.AIDS , y = Life.expectancy)) +
geom_point(aes(color=Status)) +
geom_line(data = tibble(HIV.AIDS = x_pred, Life.expectancy = y_pred), size = 1, col = "red") +
theme_minimal()
}
#Linear regression life expectancy
lr_life <- lm(Life.expectancy ~ Schooling, data = df_train)
#pred_plot(lr_life)
#Polynomial 3rd degree
#pn3_life <- lm(Life.expectancy ~ poly(Schooling, 3), data = df_train)
pn3_life <- lm(Life.expectancy ~ Schooling + I(Schooling^2) + I(Schooling^3), data = df_train)
#pred_plot(pn3_life)
# Piecewise constant
# create new dataset without missing data
newdata <- na.omit(df_train)
sections <- c(-Inf, quantile(newdata$Schooling, probs = c(.2, .4, .6, .8)), Inf)
pwc_life <- lm(Life.expectancy ~ cut(Schooling, sections), data = newdata)
#pred_plot(pwc_life)
# Piecewise polynomial
piecewise_cubic_basis <- function(vec, knots = 1) {
# If there is only one section, just return the 3rd order polynomial
if (knots == 0) return(poly(vec, degree = 3, raw = TRUE))
# cut the vector
cut_vec <- cut(vec, breaks = knots + 1)
# initialise a matrix for the piecewise polynomial
out <- matrix(nrow = length(vec), ncol = 0)
# loop over the levels of the cut vector
for (lvl in levels(cut_vec)) {
# temporary vector
tmp <- vec
# set all values to 0 except the current section
tmp[cut_vec != lvl] <- 0
# add the polynomial based on this vector to the matrix
out <- cbind(out, poly(tmp, degree = 3, raw = TRUE))
}
out
}
pwp_life <- lm(Life.expectancy ~ piecewise_cubic_basis(Schooling, 3), data = df_train)
# Cubic spline
library(splines)
bs_life <- lm(Life.expectancy ~ bs(HIV.AIDS, knots = median(Schooling)), data = df_train)
#pred_plot2(bs_life)
# Natural spline
ns3_life <- lm(Life.expectancy ~ ns(HIV.AIDS, df = 3), data = df_train)
#pred_plot2(ns3_life)
library(cowplot)
plot_grid(
pred_plot(lr_life) + ggtitle("Linear regression"),
pred_plot(pn3_life) + ggtitle("Polynomial"),
pred_plot(pwc_life) + ggtitle("Piecewise constant"),
pred_plot(pwp_life) + ggtitle("Piecewise polynomial")
)
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
With this grid graph we can see that overall developed countries have more years of schooling and thus, more life expectancy.
plot_grid(
pred_plot2(bs_life) + ggtitle("Cubic spline"),
pred_plot2(ns3_life) + ggtitle("Natural spline")
)
## Warning in predict.lm(model, newdata = tibble(HIV.AIDS = x_pred)): prediction
## from a rank-deficient fit may be misleading
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_point).
On the contrary, populations with high rate of HIV are not even registered and the prediction becomes unclear but we can observe that the points indicate low life expectancy in that case. The best life prediction is seen at 0% HIV, where developed countries are placed.